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ABSTRACT 


Classification of Quranic verses into predefined categories is an essential 
task in Quranic studies. However, in recent times, with the advancement in 
information technology and machine learning, several classification 
algorithms have been developed for the purpose of text classification tasks. 
Automated text classification (ATC) is a well-known technique in machine 


learning. It is the task of developing models that could be trained to 

automatically assign to each text instances a known label from a predefined 
Keywords: state. In this paper, four conventional ML classifiers: support vector machine 
(SVM), naive bayes (NB), decision trees (J48), nearest neighbor (k-NN), are 
used in classifying selected Quranic verses into three predefined class labels: 
faith (iman), worship (ibadah), etiquettes (akhlak). The Quranic data 
Holy Quran comprises of verses in chapter two (al-Bagara) of the holy scripture. In the 
Machine learning results, the classifiers achieved above 80% accuracy score with naive bayes 
Text classification (NB) algorithm recording the overall highest scores of 93.9% accuracy and 

0.964 AUC. 
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Feature selection 
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1. INTRODUCTION 

Machine learning 1s an important and recognized field in information technology as well as artificial 
intelligence. As the world advances over the years with massive technological growth, it becomes more 
demanding and necessary for AI systems to be able to make decisions automatically and independently 
within a time frame. In order to achieve this, computers need to learn without being explicitly 
programmed [1]. 

The field of machine learning (ML) focuses on the study that gives AI system the capability to 
improve its performance (decision making) over a time period through acquiring new knowledge and skills 
(learning/training), as well as its ability to reorganize the existing knowledge based on the newly acquired 
knowledge [2]. Thus, what clearly differentiates an intelligent AI system from other computing systems is its 
ability to learn and make decisions. 

The basic concept of ML is typically the goal of modeling machines for critical decision making 
purposes. One of the most important and widely studied techniques in machine learning 1s classification [3], 
which is the problem of identifying to which set of categories a new observation belongs, on the basis of a 
training set of data containing observations whose category membership is known [4]. A frequently applied 
area of data classification is text (also referred to as text categorization). It is the task of automatically sorting 
a set of documents into categories from a predefined set [3, 5-8]. 

The Quranic text is an important holy book of Muslims’ faithful [1, 9-10]. It is one of the most 
widely read and referenced resource. There are interesting features in the Quran which make automating the 
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textual data an attractable task in ML. These features include the arrangement of words in the Quran, the 
grouping of words into verses, verses into chapters, chapters into juz. 

For many years, Quranic scholars have devoted much attention and efforts in producing classical 
works among which are: Quran commentaries, science of hadith (prophetic sayings), and grammar (Nahw). 
However, in recent times, with the advancements in information technology and machine learning, 
automating the Quranic verses (and other related works) for the purpose of knowledge discovery 
becomes a necessity. 

Research in automating the Quranic text has gained attention in recent times. Some of the existing 
works as found in literatures include: text classification applications on the Holy Quran [1, 3, 11-13]; 
ontology-based applications [14-17]; digitized Holy Quran applications [18-22]. Furthermore, conventional 
among machine learning algorithms often implemented in ML tasks include: naive bayes (NB) [4], decision 
trees (J48) [23], neural networks [24], support vector machines (SVM) [25], and k-nearest neighbour 
(k-NN) [26]. 

An exhaustive review of the existing works in Quranic text classification showed the Quranic data 
experimented were from individual Quranic sources. However, this study opined that combining multiple 
related data sources such as the Quranic translation and commentary (tafsir) could provide more relevant 
information. Furthermore, most of the existing works are based on the Arabic which is the primary language 
of the Quran. However, study has shown that only about 15% of the world Muslims’ population [27] are 
Arabs or Arab speaking. Thus, there is a need to extend the Quranic text classification tasks to other 
languages most importantly the English language which arguably is one of the most spoken languages 
in the world. 

This paper presents the automation of Quranic verses using machine learning approach and 
technique. In this work, standard machine learning algorithms are applied for the labeling task. The study 
employed four ML classification algorithms (or classifiers). These classifiers include SVM, NB, J48, and k- 
NN algorithms.Section 2 documents the methodology employed in executing the classification task. 


2. METHODS AND MATERIALS 

The experimental work comprises of five phases as shown in Figure 1; data gathering, preprocessing 
(feature generation and selection), classification, and output/result. The input data are Quranic verses 
extracted from the combined sources of Holy Quran translation and tafsir. The study identified the 
significance of combining multiple related Quranic sources [1, 11] for better understanding of the 
input verses. 


Input data (Quranic verses) 
Feature generation 


Feature selection 





Prediction 


Output labels 


Figure 1. Experimental steps 


2.1. Data Gathering 

The experimental datasets (OTrans, OTaf, QTrans+Taf) comprise of 286 instances of Quranic data. 
The Quranic text are the words of Allah in surah Bagara (The Cow) of the holy book. The scripture 
comprises of 114 chapters (varying in size and order of revelation) in its entirety. Al-Baqara is the longest 
chapter (also called surah) in the Quran revealed in madinah, with a sum total of 286 verses. The input verses 
are grouped into one of three predefined labels: faith, worship, and etiquettes. These class labels are from the 
most fundamental aspects of Islam [1, 3]. 


2.2. Text Preprocessing 


Preprocessing 1s an important step employed when classifying textual data [1, 4]. The step includes 
feature generation, transformation, and data cleansing. Firstly, features are extracted from the Quranic 


Indonesian J Elec Eng & Comp Sci, Vol. 10, No. 1, April 2018 : 925-931 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 927 


sources using standard String to Word Vector filter tool [1]. 7F-2DF weighting method is further applied to 
access and measure the degree of relevance of the extracted features. Term frequency Tf (t,d) as shown in 
equation | is a method used in knowing the frequency of words in a documentd [11]. 


0.5xf(t,d) 
Maximum Occurrences of words 


Tf(t,d) =0.5+ (1) 


In addition, inverse-document frequency (IDF) is a method that helps to evaluate how relevant a word to the 
document. It is given as: 


_ N 
idf(t,D) = log Wacoal (2) 
The combination of 7F-IDF is given as: 
tfidf(t,d,D) = tf(t,d) - idf(t,D) (3) 


2.2.1 Feature Selection 

The generated features often come with the problem of high dimensionality. The curse of 
dimensionality is a known problem usually associated with textual data [11]. High dimensional data usually 
influence negatively the classifiers’ decisions resulting in lower classification accuracy [1]. Dimensionality 
reduction methods such as feature selection are mostly employed to reduce curse of dimensionality. There are 
two possible ways to feature selection: the ranking features approach and subset selection approach [1]. 

The ranking features approach ranks features according to a certain criterion of the feature selection 
algorithms and the top k features are selected while on the other hand, the subset selection approach selects a 
minimum subset of features without learning performance deterioration [1]. 

In this work, information gain (IG) and chisquare (CH) FS algorithms are employed for the 
dimensionality reduction purpose. InfoGain is one of the most widely applied feature selection algorithm [1]. 
The filter-based algorithm measures the inter-dependency between features and labels [1]. Mathematically, 
IG is given as: 


I(X:Y) = H(X) - H(ALY) (4) 


Chisquare filter FS algorithm is used as a test of independence to access the independence of the class label 
of a particular feature [1]. Given a feature with r different values and c classes, chisquare feature score can 
be defined as: 


ee eat 

sae Ny S taerece (5) 
2.3. Data Classification 

The ultimate aim of this study is to automate the labeling of Quranic verses using machine learning 
method. To achieve this, four ML classifiers: SVM, NB, J48, and k-NN, will be implemented using the 
conventional 10-fold cross validation method. The classifiers are trained to predict/classify Quranic text 
instances into predefined labels. 

Support vector machines (SVM) algorithm is typically used for learning classification, regression, or 
ranking function. The algorithm works by searching a seperating hyperplane to seperate between samples 
with a maximal margin [1]. The equation for hyperplane is: 


w'x+b=0 (6) 


Naive bayes (NB) classifier is a simple probabilistic model based on the bayes rule [1]. Given a class C, the 
probabilty of a particular document d to belong to C is given as: 


P(a| Ci)*P (Ci) 


P(C;| d) = Pd) 


(7) 


The decision tree (J48) classifer is a simple representation for classifying data samples as shown in 
equation 8. Structurally, the algorithm functions like a tree where each internal node [28-29] 1s labeled with 
an input feature x. 
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where vector x composed of input features while variable Y represent the target variable to be classified. 

The nearest neighbor algorithm is one of the most widely applied classifiers in pattern recognition 
[30]. The algorithm (also known as lazy learning) predict instances by measuring the distances between 
sample points using the famous Euclidean distance formula as shown in equation 9. 


d(x, x;) = dain a xij )"2 (9) 


2.4. Evaluation Metrics 

Three of the most conventional metrics [1, 3, 11] are used in evaluating the performance of the ML 
classifiers. These include: accuracy, AUC, and ROC curve. Combining these metrics provide a more accurate 
and balance performance evaluation [1]. Given a confusion matrix, accuracy is obtained using: 


TP+TN 


accuracy = ——_ 
y TP+FP+TN+FN 


(10) 
where TP is True Positive (instances correctly classified as Positive), TN is True Negative (instances 


correctly classified as Negative), FP is False Positive (instances incorrectly classified as Positive), and FN is 
False Negative (instances incorrectly classified as Negative). 


3. EXPERIMENTAL RESULTS AND ANALYSIS 

Implementation was carried out using four conventional machine learning classification algorithms 
together with information gain and chisquare feature selection algorithms. The experimental results obtained 
were evaluated and compared in terms of classification accuracy (ACC) and AUC. Furthermore, ROC curve 
metric was used in visualizing the classifiers’ performance. Tables 1 to 3 respectively shows the 
classification results of the classifiers. The classifiers were implemented in WEKA using all generated 
features (without feature selection) as well as with feature selection. 


Table 1. Classification Performance in Terms of Accuracy (ACC) and AUC (without Feature Selection) 


ML QTrans QTaf QTrans+T af 
Classifiers ACC (%) AUC ACC (%) AUC ACC (%) AUC 
SVM 86.2 0.76 88.1 0.748 88.6 0.754 
NB 87.4 0.884 89.7 0.904 90.7 0.925 
J48 193 0.601 83.2 0.7 82.5 0.679 
k-NN 81.1 0.679 81.4 0.499 83.3 0.519 


Table 2. Classification Performance in Terms of Accuracy (ACC) and AUC (with Infogain FS Algorithm) 


ML QTrans QTaf QTrans+T af 
Classifiers ACC (%) AUC ACC (%) AUC ACC (%) AUC 
SVM 88.4 0.793 91.4 0.859 90.2 0.832 
NB 90.7 0.936 91.8 0.96 93.9 0.964 
J48 84.1 0.757 85.1 0.75 83.7 0.677 
k-NN 83.2 0.72 88.3 0.751 86.9 0.774 


Table 3. Classification Performance in Terms of Accuracy (ACC) and AUC (with Chisquare FS Algorithm) 


ML QTrans QTaf QTrans+T af 
Classifiers ACC (%) AUC ACC (%) AUC ACC (%) AUC 
SVM 88.6 0.765 91.4 0.859 90.2 0.832 
NB 90.4 0.935 91.8 0.96 93.9 0.964 
J48 84.4 0.769 84.6 0.75 83.7 0.689 
k-NN 83.2 0.738 88.3 0.751 86.9 0.774 


From the classification results, the machine learning algorithms consistently achieved above 80% 
accuracy performance across all experimental datasets. However, an exemption to this is the decision tree 
(J48) algorithm which achieve the least accuracy score of 79.3% with the QTrans dataset. This could be as a 
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result of high dimensionality of the features set. As previously noted, the curse of dimensionality associated 
with text data influence the decisions of the classifiers. The ML classifiers had promising results with the 
feature selection algorithms. This again established the significance of feature selection process in data 
classification. 

Exceptional among the classification algorithms is the naive bayes (NB) classifier which 
consistently achieved the best classification results with the feature selection algorithms. The classifier 
achieved the overall highest classification result of 93.9% and AUC value of 0.964 with the Q7Taf and 
OTrans+Taf datasets. Nearest neighbour (kK-NN) classifier achieved the least AUC score of 0.499. Again, this 
maybe as result of the high dimensionality of the features set.In addition, classifiers are sensitive to the nature 
of the experimental data. This probably could be the reason why varying classification results were obtained 
in the experimental work. 

Furthermore, the classifiers’ performance was plotted for better visualization using the receiver 
operating characteristics (ROC) curve evaluation metric. The ROC curves of the classification results with 
OTrans+Taf dataset are shown in Figures 2 and 3. 
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Figure 2. ROC curve of the ML classifiers using all features (without feature selection) 





—— SVM 





~~ Naive Bayes 
~~ J48 
—— k-NN 


0.25 


l l | | | l | 
() ().1 (),2 (),3 ().4 ().5 0.6 0.7 ().8 ().9 | 
False Positive Rate 


Figure 3. ROC curve of the ML classifiers with InfoGain FS 


4. CONCLUSION 

The classification of Quranic verses into predefined categories is an essential task in Quranic 
studies. In this paper, we presented an automated machine learning approach for classifying the input 
Quranic verses. To achieve this purpose, we employed four conventional machine learning algorithms: SVM, 
NB, J48, and k-NN. 
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Features were generated from the Quranic textual data using standard machine learning techniques. 
Furthermore, InfoGain and chisquare FS methods were used to preprocess the input data in order to reduce 
the curse of dimensionality. The preprocessed textual data along with the label information were used in 
training the classifiers for the purpose of the labeling task. Constantly, throughout the experimentation, the 
conventional 10-fold cross validation method was employed. 

Finally, the classifiers’ performances were evaluated and compared. Consistently, the classifiers 
achieved above 80% accuracy score except for J48 algorithm which obtained the least accuracy score of 
79.3% with the QTrans dataset. Naive bayes (NB) classification algorithm achieved the overall highest 
accuracy result of 93.9% and AUC value of 0.964 while k-NN classifier obtained the least AUC value of 
0.499. The research study further hopes to explore several other classification application domains. 
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